[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) #1049

taurus-forever · 2025-07-15T21:13:31Z

Issue

The previous is_restart_pending() waited for 15 seconds due to the
Patroni's loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.

Solution

Instead of checking PostgreSQL pending_restart from pg_settings,
check Patroni API pending_restart=True/undefined.

Checklist

I have added or updated any relevant documentation.
I have cleaned any remaining cloud resources from my accounts.

codecov · 2025-07-15T21:29:54Z

Codecov Report

❌ Patch coverage is 29.16667% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.40%. Comparing base (ee02d5a) to head (ee8e44b).

Files with missing lines	Patch %	Lines
src/charm.py	32.00%	15 Missing and 2 partials ⚠️
src/cluster.py	28.57%	15 Missing ⚠️
src/relations/async_replication.py	0.00%	2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (29.16%) is below the target coverage (33.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (62.40%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           16/edge    #1049      +/-   ##
===========================================
- Coverage    64.87%   62.40%   -2.47%     
===========================================
  Files           17       17              
  Lines         4270     4272       +2     
  Branches       656      655       -1     
===========================================
- Hits          2770     2666     -104     
- Misses        1333     1440     +107     
+ Partials       167      166       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The previous is_restart_pending() waited for long due to the Patroni's loop_wait default value (10 seconds), which tells how much time Patroni will wait before checking the configuration file again to reload it. Instead of checking PostgreSQL pending_restart from pg_settings, let's check Patroni API pending_restart=True flag.

The current Patroni 3.2.2 has wired/flickering behaviour: it temporary flag pending_restart=True on many changes to REST API, which is gone within a second but long enough to be cougth by charm. Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade. The previous code sleept for 15 seconds waiting for pg_settings update. Also, the unnecessary restarts could be triggered by missmatch of Patroni config file and in-memory changes coming from REST API, e.g. the slots were undefined in yaml file but set as an empty JSON {} => None. Updating the default template to match the default API PATCHes and avoid restarts.

On topology observer event, the primary unit used to loose Primarly label.

Also: * use commong logger everywhere * and add several useful log messaged (e.g. DB connection) * remove no longer necessary debug 'Init class PostgreSQL' * align Patroni API requests style everhywhere * add Patroni API duration to debug logs

The list of IPs were randomly sorted causing unnecessary Partroni configuration re-generation with following Patroni restart/reload.

Housekeeping cleanup.

…hanged Those defers are necessary to support scale-up/scale-down during the refresh, while they have significalty slowdown PostgreSQL 16 bootstrap (and other daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc). Muting them for now with the proper documentation record to forbid rescaling during the refresh, untli we minimise amount of defers in PG16. Throw and warning for us to recall this promiss.

The current PG16 logic relies on Juju update-status or on_topology_change observer events, while in some cases we start Patroni without the Observer, causing a long waiting story till the next update-status arrives.

It is hard (impossible?) to catch the Juju Primary label manipulations from Juju debug-log. Logging it simplifyies troubleshooting.

We had to wait 30 seconds in case of lack of connection which is unnecessary long. Also, add details for the reason of failed connection Retry/CannotConnect.

It speedups the sinble unit app deployments.

taurus-forever changed the title ~~Use Patroni API for is_restart_pending()~~ [DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) Jul 15, 2025

taurus-forever mentioned this pull request Jul 16, 2025

Test do NOT merge #1050

Closed

taurus-forever force-pushed the alutay/is_restart_pending branch 5 times, most recently from 9c17d76 to e31de74 Compare August 2, 2025 00:40

taurus-forever added 4 commits August 14, 2025 00:42

DPE-7726: Fix topology obsert Primarly status removal

6c206b3

On topology observer event, the primary unit used to loose Primarly label.

Mute unit test temporary

addca8f

taurus-forever force-pushed the alutay/is_restart_pending branch from e31de74 to 1703639 Compare August 13, 2025 23:16

taurus-forever added 10 commits August 14, 2025 01:20

DPE-7726: Add Patroni API logging

db93686

Also: * use commong logger everywhere * and add several useful log messaged (e.g. DB connection) * remove no longer necessary debug 'Init class PostgreSQL' * align Patroni API requests style everhywhere * add Patroni API duration to debug logs

DPE-7726: Avoid unnecessary Patroni reloads

4c2211c

The list of IPs were randomly sorted causing unnecessary Partroni configuration re-generation with following Patroni restart/reload.

DPE-7726: Remove unnecessary property app_units() and scoped_peer_data()

f134270

Housekeeping cleanup.

DPE-7726: Log Patroni start/stop/restart (to undestand charm behavior)

cc4e49c

DPE-7726: Log unit status change to notice Primary label loose

832ed6a

It is hard (impossible?) to catch the Juju Primary label manipulations from Juju debug-log. Logging it simplifyies troubleshooting.

DPE-7726: Fixup logs polishing

7bb0baf

DPE-7726: Decrease waiting for DB connection timeout

78db6a2

We had to wait 30 seconds in case of lack of connection which is unnecessary long. Also, add details for the reason of failed connection Retry/CannotConnect.

DPE-7726: Stop propogating primary_endpoint=None for single unit app

ee8e44b

It speedups the sinble unit app deployments.

taurus-forever force-pushed the alutay/is_restart_pending branch from 1703639 to ee8e44b Compare August 13, 2025 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) #1049

[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) #1049

taurus-forever commented Jul 15, 2025

Uh oh!

codecov bot commented Jul 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) #1049

Are you sure you want to change the base?

[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) #1049

Conversation

taurus-forever commented Jul 15, 2025

Issue

Solution

Checklist

Uh oh!

codecov bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov bot commented Jul 15, 2025 •

edited

Loading